{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/nvidia_triton.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Nvidia Triton"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server) provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. This connector allows for llama_index to remotely interact with TRT-LLM models deployed with Triton."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Launching Triton Inference Server"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This connector requires a running instance of Triton Inference Server with A TensorRT-LLM model.\n",
    "For this example, we will use a [Triton Command Line Interface (Triton CLI)](https://github.com/triton-inference-server/triton_cli) to deploy a GPT2 model on Triton.\n",
    "\n",
    "When using Triton and related tools on your host (outside of a Triton container image) there are a number of additional dependencies that may be required for various workflows. Most system dependency issues can be resolved by installing and running the CLI from within the latest corresponding `tritonserver` container image, which should have all necessary system dependencies installed.\n",
    "\n",
    "For TRT-LLM, you can use `nvcr.io/nvidia/tritonserver:{YY.MM}-trtllm-python-py3` image, where `YY.MM` corresponds to the version of `tritonserver`, for example in this example we're using 24.02 version of the container. To get the list of available versions, please refer to [Triton Inference Server NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).\n",
    "\n",
    "To start the container, run in your Linux terminal:\n",
    "\n",
    "```\n",
    "docker run -ti --gpus all --network=host --shm-size=1g --ulimit memlock=-1 nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3\n",
    "```\n",
    "Next, we'll need to install dependencies with the following:\n",
    "```\n",
    "pip install \\\n",
    "  \"psutil\" \\\n",
    "  \"pynvml>=11.5.0\" \\\n",
    "  \"torch==2.1.2\" \\\n",
    "  \"tensorrt_llm==0.8.0\" --extra-index-url https://pypi.nvidia.com/\n",
    "```\n",
    "Finally, run the following to install Triton CLI."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "pip install git+https://github.com/triton-inference-server/triton_cli.git\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To generate model repository for GPT2 model and start an instance of Triton Server:\n",
    "```\n",
    "triton remove -m all\n",
    "triton import -m gpt2 --backend tensorrtllm\n",
    "triton start &\n",
    "```\n",
    "Please, note that by default Triton starts listenning to `localhost:8000` HTTP port and `localhost:8001` GRPC port. The latter will be used in this example.\n",
    "For any additional how-tos and questions, please reach out to [Triton Command Line Interface (Triton CLI)](https://github.com/triton-inference-server/triton_cli) issues.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install tritonclient\n",
    "Since we are interacting with the Triton Inference Server we will need to [install](https://github.com/triton-inference-server/client?tab=readme-ov-file#download-using-python-package-installer-pip) the `tritonclient` package."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "pip install tritonclient[all]\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll install llama index connector.\n",
    "```\n",
    "pip install llama-index-llms-nvidia-triton\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Call `complete` with a prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "from llama_index.llms.nvidia_triton import NvidiaTriton\n",
    "\n",
    "# A Triton server instance must be running. Use the correct URL for your desired Triton server instance.\n",
    "triton_url = \"localhost:8001\"\n",
    "model_name = \"gpt2\"\n",
    "resp = NvidiaTriton(server_url=triton_url, model_name=model_name, tokens=32).complete(\"The tallest mountain in North America is \")\n",
    "print(resp)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should expect the following response\n",
    "```\n",
    "the Great Pyramid of Giza, which is about 1,000 feet high. The Great Pyramid of Giza is the tallest mountain in North America.\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Call `stream_complete` with a prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "resp = NvidiaTriton(server_url=triton_url, model_name=model_name, tokens=32).stream_complete(\"The tallest mountain in North America is \")\n",
    "for delta in resp:\n",
    "    print(delta.delta, end=\" \")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should expect the following response as a stream\n",
    "```\n",
    "the Great Pyramid of Giza, which is about 1,000 feet high. The Great Pyramid of Giza is the tallest mountain in North America.\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Further Examples\n",
    "For more information on Triton Inference Server, please refer to a [Quickstart](https://github.com/triton-inference-server/server/blob/main/docs/getting_started/quickstart.md#quickstart) guide, [NVIDIA Developer Triton page](https://developer.nvidia.com/triton-inference-server), and [GitHub issues](https://github.com/triton-inference-server/server/issues) channel."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  },
  "vscode": {
   "interpreter": {
    "hash": "a0a0263b650d907a3bfe41c0f8d6a63a071b884df3cfdc1579f00cdc1aed6b03"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
